Playing with distances: Document Similarity

نویسندگان

Harsh Thakkar

Ganesh Iyer

Honey Patel

Kesha Shah

چکیده

Spoken information retrieval is a promising domain of research. In this paper we describe our participation in the pilot Document Similarity Amid Automatically Detected Terms task of FIRE 2014. We present the findings on our experiments with variants of distance and timestamp based approaches. The de-normalized distance based variant outperformed other two delivering best results of the submitted runs. However, there is scope for further improvement in the results.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic keyword extraction using Latent Dirichlet Allocation topic modeling: Similarity with golden standard and users' evaluation

Purpose: This study investigates the automatic keyword extraction from the table of contents of Persian e-books in the field of science using LDA topic modeling, evaluating their similarity with golden standard, and users' viewpoints of the model keywords. Methodology: This is a mixed text-mining research in which LDA topic modeling is used to extract keywords from the table of contents of sci...

متن کامل

خوشه‌بندی اسناد مبتنی بر آنتولوژی و رویکرد فازی

Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...

متن کامل

The Quadratic-Chi Histogram Distance Family - Appendices

This document contains the appendices for the paper “The Quadratic-Chi Histogram Distance Family” [1], proofs and additional results. In section 2 we prove that all Quadratic-Chi histogram distances are continuous. In section 3 we prove that EMD, ÊMD and all Quadratic-Chi histogram distances are Similarity-Matrix-QuantizationInvariant. In section 4 we present additional shape classification res...

متن کامل

Enhancement of Search Results Using Dynamic Document Seed Reranking Algorithm

We proposed an algorithm to improve the precision of top retrieved documents by reordering the retrieved documents in the initial retrieval. To re-order the documents, we first automatically extract key terms and key phrases from top N retrieved documents and generate a document index for each document. Using the standard similarity metrics, a document similarity matrix is generated for these d...

متن کامل

نقش ارتباطات معنایی در بهبود نتایج یک سیستم پیشنهاد استناد- مقاله برگزیده هفدهمین کنفرانس ملی انجمن کامپیوتر ایران

With the increasingly growth of scientific documents in the Web, it is difficult to select a concerned document. A citation recommendation system receives a text and recommends documents to be cited by the text. Such recommendation helps a researcher in hitting his/her concerned texts. Based on sematic relations, this paper presents a new indicator to measure the similarity between documents an...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2014

Playing with distances: Document Similarity

نویسندگان

چکیده

منابع مشابه

Automatic keyword extraction using Latent Dirichlet Allocation topic modeling: Similarity with golden standard and users' evaluation

خوشه‌بندی اسناد مبتنی بر آنتولوژی و رویکرد فازی

The Quadratic-Chi Histogram Distance Family - Appendices

Enhancement of Search Results Using Dynamic Document Seed Reranking Algorithm

نقش ارتباطات معنایی در بهبود نتایج یک سیستم پیشنهاد استناد- مقاله برگزیده هفدهمین کنفرانس ملی انجمن کامپیوتر ایران

عنوان ژورنال:

اشتراک گذاری